Code
library(tidyverse)
library(knitr)
library(av)
library(gganimate)
library(broom)
library(kableExtra)
library(bibtex)library(tidyverse)
library(knitr)
library(av)
library(gganimate)
library(broom)
library(kableExtra)
library(bibtex)gdp_data <- read.csv("gdp_pcap.csv")
life_expectancy_data <- read.csv("sp_dyn_le00_in.csv")gdp_long <- gdp_data |>
select(country, X1960:X2022) |>
pivot_longer(cols = X1960:X2022,
names_to = "Year",
values_to = "GDP per Capita") |>
mutate(
Year = as.integer(str_remove(Year, "^X")),
`GDP per Capita` = case_when(
str_detect(`GDP per Capita`, "k") ~ parse_number(`GDP per Capita`) * 1000,
TRUE ~ parse_number(`GDP per Capita`)))
# Only selecting 1960-2022 to match life expectancy and not include predicted GDP after 2022life_expectancy_long <- life_expectancy_data |>
pivot_longer(cols = X1960:X2022,
names_to = "Year",
values_to = "Life Expectancy") |>
mutate(Year = as.integer(str_remove(Year, "^X")))combined_data <- gdp_long |>
inner_join(life_expectancy_long, join_by(country, Year))# Define country vectors by continent
# Used ChatGPT to format this (https://chatgpt.com/share/67c0f690-5020-8004-b149-ef2bae1b914e)
combined_data <- combined_data |>
mutate(
continent = factor(case_when(
country %in% c("Afghanistan", "Armenia", "Azerbaijan", "Bahrain", "Bangladesh",
"Bhutan", "Brunei", "Cambodia", "China", "Georgia", "Hong Kong, China", "India",
"Indonesia", "Iran", "Iraq", "Israel", "Japan", "Jordan",
"Kazakhstan", "Kuwait", "Kyrgyz Republic", "Lao", "Lebanon",
"Malaysia", "Maldives", "Mongolia", "Myanmar", "Nepal",
"North Korea", "Oman", "Pakistan", "Palestine", "Philippines",
"Qatar", "Saudi Arabia", "Singapore", "South Korea", "Sri Lanka",
"Syria", "Tajikistan", "Thailand", "Timor-Leste", "Turkey",
"Turkmenistan", "UAE", "Uzbekistan", "Vietnam", "Yemen") ~ "Asia",
country %in% c("Angola", "Algeria", "Benin", "Botswana", "Burkina Faso",
"Burundi", "Cape Verde", "Cameroon", "Central African Republic",
"Chad", "Comoros", "Congo, Dem. Rep.", "Congo, Rep.", "Cote d'Ivoire",
"Djibouti", "Egypt", "Equatorial Guinea", "Eritrea", "Eswatini",
"Ethiopia", "Gabon", "Gambia", "Ghana", "Guinea", "Guinea-Bissau",
"Ivory Coast", "Kenya", "Lesotho", "Liberia", "Libya",
"Madagascar", "Malawi", "Mali", "Mauritania", "Mauritius",
"Morocco", "Mozambique", "Namibia", "Niger", "Nigeria",
"Rwanda", "Sao Tome and Principe",
"Senegal", "Seychelles", "Sierra Leone", "Somalia", "South Africa",
"South Sudan", "Sudan", "Tanzania", "Togo", "Tunisia", "Uganda",
"Zambia", "Zimbabwe") ~ "Africa",
country %in% c("Albania", "Andorra", "Austria", "Belarus", "Belgium",
"Bosnia and Herzegovina", "Bulgaria", "Croatia", "Cyprus",
"Czech Republic", "Denmark", "Estonia", "Finland", "France",
"Germany", "Greece", "Hungary", "Iceland", "Ireland", "Italy",
"Latvia", "Liechtenstein", "Lithuania", "Luxembourg", "Malta",
"Moldova", "Monaco", "Montenegro", "Netherlands", "North Macedonia",
"Norway", "Poland", "Portugal", "Romania", "Russia", "San Marino",
"Serbia", "Slovak Republic", "Slovenia", "Spain", "Sweden", "Switzerland",
"Ukraine", "UK", "Vatican City") ~ "Europe",
country %in% c("Antigua and Barbuda", "Bahamas", "Barbados", "Belize",
"Canada", "Costa Rica", "Cuba", "Dominica",
"Dominican Republic", "El Salvador", "Grenada",
"Guatemala", "Haiti", "Honduras", "Jamaica", "Mexico", "Micronesia, Fed. Sts.","Nicaragua", "Panama", "St. Kitts and Nevis",
"St. Lucia", "St. Vincent and the Grenadines",
"Trinidad and Tobago", "USA") ~ "North America",
country %in% c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia",
"Ecuador", "Guyana", "Paraguay", "Peru", "Suriname",
"Uruguay", "Venezuela") ~ "South America",
country %in% c("Australia", "Fiji", "Kiribati", "Marshall Islands",
"Micronesia", "Nauru", "New Zealand", "Palau", "Papua New Guinea",
"Samoa", "Solomon Islands", "Tonga", "Tuvalu", "Vanuatu") ~ "Oceania",
TRUE ~ "Other" # For countries not in any of the continent lists
), levels = c("Asia", "Africa", "Europe", "North America", "South America", "Oceania", "Other"))
) |>
drop_na()This analysis explores the relationship between GDP per capita and life expectancy to uncover the story of how a country’s wealth can shape the well-being and future of its people (Gapminder (2025b)). By understanding these variables, we can better understand how economic growth can lead to healthier, longer lives. We aim to answer several key questions: How does this relationship vary by continent? How has this relationship manifested over time? Is there a point where economic growth stops contributing to improved life expectancy, or does it continue to have a significant impact?
Many studies have examined the relationship between GDP and a country’s life expectancy, consistently reaching similar conclusions that build upon one another. Research has largely found that an increase in GDP per capita leads to higher life expectancy. The International Journal of Health Sciences and Research quantifies this relationship through panel data analysis, stating that “each additional $10,000 per capita per year increases life expectancy at birth by an average of 1.8 years” (Shafi and Fatima (2019)). Further research from Georgia Tech suggests that this growth follows a non-linear pattern, with diminishing returns at higher GDP levels, where additional increases no longer significantly impact life expectancy (Shah, Akram, and Alvi (2023)). Given these findings, our hypothesis aligns with Georgia Tech’s research—we expect GDP per capita to have a strong positive effect on life expectancy, but only up to a certain threshold, after which the impact levels off ((Georgia Tech (2023)). This result conveys the idea that increased economic output contributes to a healthier lifestyle with greater access to resources, however biological limitations are present as this cannot be a continuous relationship.
The GDP data set contains data on the gross domestic product per person adjusted for differences in purchasing power, in international dollars and fixed at 2017 prices. The “international dollars” currency is adjusted for Purchasing Power Parity (PPP) and is a virtual currency that enables better comparisons that allow us to compare what a dollar would buy in each country (a comparable amount of goods and services) as a U.S. dollar would buy in the United States. Additionally, GDP per capita is the gross domestic product divided by the population of the country, which gives us a rough estimate of the average annual income of the citizens. The GDP data set contains a “country” variable which has the names of each country, and the GDP per capita for each of the countries in the data set from the year 1800 to 2022.
The Life Expectancy data set contains data on life expectancy at birth in total number of years. Life expectancy at birth can be defined as the number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life. This data set also contains a “country” variable and the life expectancy at birth value for each country in the data set for every year from 1960 to 2022.
The most important phase of data cleaning was removing unwanted observations. After 2022, the GDP data set reported predicted values for GDP. We want our analysis to rely solely on observed values, so these years were removed. Additionally, the observations prior to 1960 were removed since no measurements for life expectancy were recorded prior to that date. Keeping the years 1960 to 2022 allowed us to have real data for both quantitative variables.
The plots below illustrate the relationship between GDP per capita (on a log scale) and life expectancy across different continents. Plot 1 displays points representing each country within a given continent, with a red linear regression line fitted to the overall data. The plot is then faceted by continent, allowing for a comparison of how the relationship between GDP per capita and life expectancy varies across different regions of the world. Plot 2 places these regression lines on a singular plot. It is important to note that GDP is recorded in international dollars fixed at 2017 prices.
ggplot(combined_data, aes(x = log10(`GDP per Capita`), y =`Life Expectancy`)) +
geom_point(aes(color = continent), alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs( title = "Plot 1: GDP and Life Expectancy Regression Faceted by Continent",
subtitle = "GDP is in international dollars and fixed at 2017 prices",
x = "Log of GDP per Capita",
y = "Life Expectancy", color = "Continent",
caption = "Figure 1") +
theme_bw() +
theme(strip.background = element_blank(), axis.text.x = element_text(size = 10)) +
facet_wrap(~continent)
ggplot(combined_data, aes(x = log(`GDP per Capita`), y = `Life Expectancy`, color = continent)) +
geom_smooth(method = "lm", aes(group = continent), se = FALSE) +
labs(
title = "Plot 2: GDP and Life Expectancy Regression by Continent",
subtitle = "Estimated Regression Lines by Continent",
x = "Log GDP per Capita",
y = "Life Expectancy",
color = "Continent",
caption = "Figure 2"
) +
theme_bw() The graph illustrates a moderately strong, positive relationship between GDP per capita (logged) and life expectancy across all continents, indicating that wealthier countries tend to have higher life expediencies. However, the strength of this relationship varies by region. For instance, South America and Africa exhibit a larger positive trend, while Oceania reflects a flatter slope. Additionally, the graph suggests diminishing returns, as the impact of increasing GDP on life expectancy appears more pronounced at lower income levels and flattens at higher GDP levels. There are a few outliers visible, with possible outliers having a higher life expectancy than predicted based on their GDP. These outliers are from Africa and Asia, but in general the linear relationship seems to hold for most data points.
This plot is similar to the one above, but allows for some more direct comparisons between regions. While this also graphs life expectancy and log GDP per capita for each region, this chart does not display individual country values. With all fitted regression lines on the same plot, we can better identify the range of log GDP values, compare slopes, and evaluate similar trends. While we can still see that South America has the highest life expectancy growth rates based on log GDP, we can also discern that Europe and Africa have similar slopes.
This animated plot displays the average GDP per capita and average life expectancy for each region. These values are then plotted over time, capturing the years 1960 to 2022. Since we are evaluating two quantitative variables, average gdp is measured on the y axis, and average life expectancy is measured in point size. Each continental region is then represented by its own color.
combined_data_avg <- combined_data |>
group_by(Year, continent) |>
summarise(
avg_life_expectancy = mean(`Life Expectancy`, na.rm = TRUE),
avg_gdp_per_capita = mean(`GDP per Capita`, na.rm = TRUE)
)
p <- ggplot(combined_data_avg, aes(x = Year, y = avg_gdp_per_capita,
size = avg_life_expectancy, color = continent)) +
geom_point(alpha = 0.7) +
scale_size_continuous(range = c(1, 5)) +
labs(
title = "Average GDP per Capita vs Year with Bubble Size as GDP per Capita",
subtitle = "Size of bubble represents average life expectancy",
x = "Year",
y = "Average GDP per Capita",
color = "Continent",
size = "Average Life Expectancy",
caption = "Figure 3"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_color_brewer(palette = "Set2") +
transition_time(Year) +
shadow_mark() + # Keeps all past years visible
ease_aes('linear') # Smooth transitions
# Animate the plot
animate(p, nframes = 100, fps = 10)From this, we can see that in general, both average GDP per capita and average live expectancy increase for all continents over time. It is important to note that this graph does not utilize log gdp values, which provides better perspective on these changes over time. Europe consistently has the highest average GDP per capita, and has high life expectancy values. It is also clear that Africa makes the most significant life expectancy growth within this time period, however has very small average GDP per capita. Further, Europe reflects the largest increase in average GDP per capita over time, in addition increasing their average life expectancy.
To analyze the relationship between Life Expectancy and GDP per capita, we used a linear regression model. We regressed Life Expectancy onto a log transformation of GDP per capita to model causality. The log transformation was necessary to accommodate an unusual distribution in the GDP per capita data and make it more normal. Figure 4 shows the relationship between the transformed GDP and Life Expectancy as well as the resulting Ordinary Least Squares (OLS) model. The interpretation of the model coefficients are the life expectancy when GDP per capita is 0 (\(\beta_{0}\)) and the increase in Life Expectancy for every 1% increase in GDP per capita (\(\beta_{1}\)).
gdp_regression <- lm(`Life Expectancy` ~ log(`GDP per Capita`),
data = combined_data)
#The data model we used for the equation
ggplot(combined_data, aes(x = log10(`GDP per Capita`),
y =`Life Expectancy`)) +
geom_point(aes(color = continent),
alpha = 0.7) +
geom_smooth(method = "lm",
color = "red",
se = FALSE) +
labs( title = "Relationship Between GDP and Life Expectancy\n(Linear Regression Model)",
subtitle = "GDP is in international dollars and fixed at 2017 prices",
x = "Log of GDP per Capita",
y = "Life Expectancy",
color = "Continent",
caption = "Figure 4") +
theme_bw() +
theme(strip.background = element_blank(),
axis.text.x = element_text(size = 10))tidy(gdp_regression) |>
kable(caption = "Linear Regression Equation Information")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -3.606519 | 0.4837107 | -7.455943 | 0 |
log(GDP per Capita) |
7.668944 | 0.0543856 | 141.010679 | 0 |
glance(gdp_regression) |>
kable(caption = "Linear Regression Fit Information")| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.6252438 | 0.6252124 | 6.950341 | 19884.01 | 0 | 1 | -40023.13 | 80052.27 | 80074.42 | 575725.7 | 11918 | 11920 |
For our linear regression model, we choose to use all the data we had in order to obtain a model that is the best representation of the association between life expectancy and GDP per capita. GDP per capita is still logged, as explained above, in order for the association to be linear and fit well with the model.
\[ \hat{y} = -3.607 + 7.669*log(x) \]
When the log of a country’s GDP is 0 (a GDP per capita of $1) the country is estimated to have a life expectancy of –3.607 years, even though that is unrealistic. Then, each time the log of GDP per capita increases by 1, the life expectancy of that country is predicted to increase by 7.669 years.
To understand how well our OLS model explains the variability in life expectancy, we can examine how variance is distributed across observed values, predicted values, and residuals. The overall variance in life expectancy represents the total range of values across all observations. The variance in predicted values indicates the portion of this variation that is explained by GDP per capita, while the variance in residuals accounts for the unexplained differences. By analyzing these components, we can determine the proportion of variance captured by the model and evaluate its effectiveness in predicting life expectancy.
# Calculate variances
var_response <- var(combined_data$`Life Expectancy`, na.rm = TRUE)
var_fitted <- var(fitted(gdp_regression), na.rm = TRUE)
var_residuals <- var(residuals(gdp_regression), na.rm = TRUE)
# Create a dataframe for display
variance_table <- data.frame(
Statistic = c("Variance in Response", "Variance in Fitted Values", "Variance in Residuals"),
Value = c(var_response, var_fitted, var_residuals))
variance_table |>
kable(caption = "Variance in Regression Model") |>
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Statistic | Value |
|---|---|
| Variance in Response | 128.89231 |
| Variance in Fitted Values | 80.58912 |
| Variance in Residuals | 48.30319 |
The amount of variability in Life Expectancy accounted for by the regression model can be calculated as the proportion of Variance in Fitted Values over the Variance in Response: 80.59 / 128.89 ≈ 0.63 This value is validated in the Linear Regression Fit Information table, as this number matches the R squared value. This tells us that 62.52% of the variability in life expectancy is explained by the model, indicating a moderate-to-strong relationship between GDP per capita and life expectancy. The remaining 37.48% of the variability (Variance in Residuals) is unexplained by the model, suggesting that there are other extraneous factors that also influence life expectancy.
The following plots extend our analysis by incorporating simulated values for the GDP variable, which allows us to better understand the variability and predictive relationships in our dataset. Plot 1 revisits the relationship between GDP per capita (on a log scale) and life expectancy, whereas Plot 2 compares actual GDP per capita with simulated GDP values, which are generated based on our regression model, including random error. This helps assess the model’s predictive accuracy and the distribution of GDP estimates.
predict_gdp <- predict(gdp_regression)
estimated_sigma <- sigma(gdp_regression)
rand_error <- function(x, mean = 0, sd){
errors <- rnorm(n=length(x), mean = mean, sd = sd)
y <- (x + errors)
return(y)
}
set.seed(1234)
sim_response <- tibble(sim_gdp = rand_error(predict_gdp,
sd = estimated_sigma))
map_int(combined_data, ~sum(is.na(.x))) country Year GDP per Capita Life Expectancy continent
0 0 0 0 0
data_with_predict <- combined_data |>
bind_cols(sim_response)#Graph 1 here
ggplot(combined_data, aes(x = log10(`GDP per Capita`),
y =`Life Expectancy`)) +
geom_point(aes(color = continent),
alpha = 0.7) +
geom_smooth(method = "lm",
color = "red",
se = FALSE) +
labs( title = "Relationship Between GDP and Life Expectancy\n(Linear Regression Model)",
subtitle = "GDP is in international dollars and fixed at 2017 prices",
x = "Log of GDP per Capita",
y = "Life Expectancy", color = "Continent",
caption = "Figure 5") +
theme_bw() +
theme(strip.background = element_blank(),
axis.text.x = element_text(size = 10))
#Graph 2 here
ggplot(data_with_predict,
aes(x = log10(`GDP per Capita`),
y =sim_gdp)) +
geom_point(aes(color = continent),
alpha = 0.7) +
geom_smooth(method = "lm",
color = "red",
se = FALSE) +
labs( title = "Relationship Between GDP and Predicted GDP",
subtitle = "GDP is in international dollars and fixed at 2017 prices",
x = "Log of GDP per Capita",
y = "Simulated GDP",
color = "Continent",
caption = "Figure 6") +
theme_bw() +
theme(strip.background = element_blank(),
axis.text.x = element_text(size = 10))Plot 1 is included here once again to show a side-by-side comparison between the real data and simulated (expected) values. Plot 1 revisits the relationship between GDP per capita (on a log scale) and life expectancy, reinforcing the overall positive correlation and highlighting how this trend varies across continents. Plot 2 examines how well our regression model predicts GDP using simulated values. The x-axis still represents log-transformed GDP per capita, but instead of life expectancy, the y-axis now plots a simulated version of GDP, incorporating random error to reflect real-world variability. The presence of a strong positive correlation (as indicated by the red regression line) suggests that the model effectively captures the general trend of GDP, though some dispersion remains due to the introduced random variation. The distribution of points across continents is similar to Plot 1, reaffirming that economic patterns are broadly consistent across different regions.
Finally, in order to evaluate the variability of R-squared values in our regression model, we utilized a simulation. This process involved generating multiple simulated versions of GDP per capita, refitting the regression model for each, and recording the resulting R-squared values. After simulating 1000 R-squared values, these were then plotted into a histogram with an overlaid normal distribution curve.
#returns one r^2 value for a simulated version of the data
get_r2 <- function(regression){
predicted <- predict(regression)
est_sigma <- sigma(regression)
sim_response <- tibble(sim_response = rand_error(predicted,
sd = est_sigma))
data_with_predict2 <- combined_data |>
bind_cols(sim_response) |>
mutate(`Log of GDP per Capita` = log(`GDP per Capita`))
sim_regression <- lm(data_with_predict2$sim_response ~ data_with_predict2$`Log of GDP per Capita`)
r2_data <- glance(sim_regression)
return(r2_data$r.squared)
}
#Taking 1000 samples
r2s <- map_dbl(.x= 1:1000,
.f = ~get_r2(gdp_regression))
#graphing the samples
tibble(r2s) |>
ggplot(aes(x = r2s)) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = ~dnorm(.x, mean = mean(r2s),
sd = sd(r2s)),
col = "darkorchid",
lwd = 2) +
labs(x = "R squared values",
y = "Density",
title = "Sample Distribution of R Square Values",
subtitle = "For predicted vs actual log of GDP per Capita",
caption = "Figure 7")The highest point of the distribution is around 0.625, which aligns with the observed R-squared value from the original model of 0.63. The relatively narrow spread suggests that the model consistently explains around 62-63% of the variance in life expectancy across different samples, implying good reliability. The overlaid density curve approximates a normal distribution, further indicating that the variation in R-squared values follows a predictable pattern.
This analysis aimed to determine the existence of a relationship between GDP per capita and life expectancy. We believed there would be a positive correlation between them meaning that as countries saw their economy grow, they would see a positive change in health outcomes such as life expectancy. Our findings concurred with this and showed a strong relationship between GDP growth an life expectancy. As GDP per capita increases by one percent in a country, its life expectancy increased by 7.669 years.
We tested the strength of our model by using it to predict life expectancies based on simulated GDP data. The simulation yeilded normal residuals when compared to the actual life expectancy data. This suggests the model is an accurate representation of the trends in the data.
Our findings concur with existing research and support the general conclusion that an increase in per person wealth increases lifespan. An important thing tp note is that our model used a log transformation on GDP meaning that smaller economies will derive more benefit from the same increase in GDP per capita than larger economies.
There are limitations in our model given its simplicity. It is possible that there is reverse causation, when people have greater life expectancy they are productive members of society for longer. Additionally, there are likley missing variables, such as investment or geopolitical stability, that could help explain more of the variance in the data. These provide opportunities for further research and looking for more ways to increase life expectancy.